Assignment 3.3

Irfan Nur Afif (1035476)

Timothy Aerts (0756341)

Image Caption Retrieval Model

1. Data preprocessing

We will use Microsoft COCO (Common Objects in Context) data set to train our "Image Caption Retrieval Model". This data set consists of pretrained 10-crop VGG19 features (Neural codes) and its corresponding text caption.

In [1]:
from __future__ import print_function

import os
import sys
import numpy as np
import pandas as pd
from collections import OrderedDict

DATA_PATH = 'data'
IMAGE_DATA= 'val2014'
EMBEDDING_PATH = 'embeddings'
MODEL_PATH = 'models'

You will need to create above directories and locate data set provided in directory 'data'

Reading pairs of image (VGG19 features) - caption data

In [4]:
# DO NOT CHANGE BELOW CODE

import collections

np_train_data = np.load(os.path.join(DATA_PATH,'train_data.npy'))
np_val_data = np.load(os.path.join(DATA_PATH,'val_data.npy'))

train_data = collections.OrderedDict()
for i in range(len(np_train_data.item())):
    cap =  np_train_data.item()['caps']
    img =  np_train_data.item()['ims']
    train_data['caps'] = cap
    train_data['ims'] = img
    
val_data = collections.OrderedDict()
for i in range(len(np_val_data.item())):
    cap =  np_val_data.item()['caps']
    img =  np_val_data.item()['ims']
    val_data['caps'] = cap
    val_data['ims'] = img

Reading caption and information about its corresponding raw images from Microsoft COCO website

In [5]:
# DO NOT CHANGE BELOW CODE
# use them for your own additional preprocessing step
# to map precomputed features and location of raw images 

import json

with open(os.path.join(DATA_PATH,'instances_val2014.json')) as json_file:
    coco_instances_val = json.load(json_file)
    
with open(os.path.join(DATA_PATH,'captions_val2014.json')) as json_file:
    coco_caption_val = json.load(json_file)

Additional preprocessing

In [6]:
# create your own function to map pairs of precomputed features and filepath of raw images
# this will be used later for visualization part
# simple approach: based on matched text caption (see json file)

# YOUR CODE HERE 



#todo: mapping

######### HELPER FUNCTION

import re

#x_caption = prepare_caption(train_caption_ids, train_caps)


#string to string caption
def clean_cap(caption):
    return re.sub('[^0-9a-zA-Z]+', ' ',caption).lower().strip()

def arrstrcap2arrintcap(arrcaption):
    arr=np.zeros(50,dtype=int)
    for id,i in enumerate(arrcaption):
        try:
            arr[id]=words_indices[i]
        except KeyError:
            arr[id]=0
    return arr
    
def imgid2cap(imgid):
    cap=[]
    for i in coco_caption_val['annotations']:
        if(i['image_id']==imgid):
            cap.append(i['caption'])
    if(len(cap)==0):
        raise Exception('Caption not found!')
    return cap

def cap2imgid(cap):
    imgid=-1
    for i in coco_caption_val['annotations']:
        if(i['caption']==cap):
            imgid=i['image_id']
    if(imgid==-1):
        raise Exception('Image not found!')
    return imgid


def mapto(xtrainid):
    guessedcaption=true_caption(x_caption[xtrainid])
    #print(guessedcaption)
    img_id=-1
    for i in coco_caption_val['annotations']:
        capt=re.sub('[^0-9a-zA-Z]+', ' ',i['caption']).lower().strip()
#        print(capt)
        if(capt==guessedcaption):
            img_id=i['image_id']
            break
    for i in coco_instances_val['images']:
        if(i['id']==img_id):
            #return i['coco_url']
            show_img(i['file_name'])
    #return 'notfound!, guesdescaption= {}'.format(guessedcaption)

#translate array of int to a sentence
#input: np array of caption (encoded in [int])
#output: string
def true_caption(cap):
    caplist=[indices_words[i] for i in cap]
    strcap=""
    for i in caplist:
        if i!='<pad>' and i!='<unk>':
            strcap+=i+' '
    return strcap.strip()


def find_original_caption(image_id):
    arr=['' for i in range(5)]
    idx=0
    for i in coco_caption_val['annotations']:
        if(i['image_id']==image_id):
            arr[idx]=i['caption']
            idx+=1
    return arr

#true_caption(x_val_caption[10])

Build vocabulary index

In [7]:
# DO NOT CHANGE BELOW CODE

def build_dictionary(text):

    wordcount = OrderedDict()
    for cc in text:
        words = cc.split()
        for w in words:
            if w not in wordcount:
                wordcount[w] = 0
            wordcount[w] += 1
    words = list(wordcount.keys())
    freqs = list(wordcount.values())
    sorted_idx = np.argsort(freqs)[::-1]
    

    worddict = OrderedDict()
    worddict['<pad>'] = 0
    worddict['<unk>'] = 1
    for idx, sidx in enumerate(sorted_idx):
        worddict[words[sidx]] = idx+2  # 0: <pad>, 1: <unk>
    

    return worddict

# use the resulting vocabulary index as your look up dictionary
# to transform raw text into integer sequences

all_captions = []
all_captions = train_data['caps'] + val_data['caps']

# decode bytes to string format
caps = []
for w in all_captions:
    caps.append(w.decode())
    
words_indices = build_dictionary(caps)
print ('Dictionary size: ' + str(len(words_indices)))
indices_words = dict((v,k) for (k,v) in words_indices.items())

##add custom
#words_indices = dict((k,v) for (k,v) in words_indices.items())
Dictionary size: 11473
In [2]:
from keras.layers import Dense, Embedding,Input,LSTM,GRU,Lambda,add,dot,subtract, maximum
from keras.models import Model
import keras.backend as K
c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.

2. Image - Caption Retrieval Model

Image model

In [12]:
# YOUR CODE HERE 
from keras.layers import Dense, Embedding,Dot,Input,LSTM,GRU,Add, Subtract, concatenate

#image network
img_input = Input(shape=(4096,),name='IMG_input')
condense_img = Dense(1024,name='Dense_IMG')(img_input)

Caption model

In [13]:
import gensim
from gensim.models import KeyedVectors
path = ".."

#convert GloVe into word2vec format
#gensim.scripts.glove2word2vec.get_glove_info(path)
#gensim.scripts.glove2word2vec.glove2word2vec(path, "glove_converted.txt")

glove = KeyedVectors.load_word2vec_format("../glove_converted.txt", binary=False)
c:\users\illia\appdata\local\conda\conda\envs\tensorflow-gpu\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
  warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
In [14]:
# YOUR CODE HERE

voc_size = len(indices_words)#11k ish
cap_size = 50

caption_input = Input(shape=(cap_size,),name='CAP_input')
noise_input = Input(shape=(cap_size,),name='Noise_input')

# layer for computing dot product between tensors
vocab_dim = 300 # dimensionality of your word vectors
n_symbols = voc_size + 1 # adding 1 to account for 0th index (for masking)
embedding_weights = np.zeros((n_symbols, vocab_dim))
for word,index in words_indices.items():
    try:
        embedding_weights[index, :] = glove[word]
    except KeyError:
        embedding_weights[index, :] = np.zeros(vocab_dim)
# define inputs here
    embedding_layer = Embedding(output_dim=vocab_dim, input_dim=n_symbols, trainable=False)
embedding_layer.build((None,)) # if you don't do this, the next step won't work
embedding_layer.set_weights([embedding_weights])


recurrent_layer = LSTM(1024,name='recurrent_layer')


#inputs into shared layers
embed_caption = embedding_layer(caption_input)
embed_noise = embedding_layer(noise_input)

recurrent_noise = recurrent_layer(embed_noise)
recurrent_caption = recurrent_layer(embed_caption)

Join model

In [15]:
# YOUR CODE HERE



#noise and real score
cap_image = dot([condense_img,recurrent_caption],1,normalize=True, name='DotProd_postive_score')
noise_image = dot([condense_img,recurrent_noise],1,normalize=True, name='DotProd_negative_score')
conc = concatenate([cap_image,noise_image],axis=-1)

Main model for training stage

In [16]:
# YOUR CODE HERE

# define your model input and output
print ("loading the training model")
training_model = Model(inputs=[img_input,caption_input,noise_input],outputs=conc)
training_model.summary()
loading the training model
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
CAP_input (InputLayer)          (None, 50)           0                                            
__________________________________________________________________________________________________
Noise_input (InputLayer)        (None, 50)           0                                            
__________________________________________________________________________________________________
IMG_input (InputLayer)          (None, 4096)         0                                            
__________________________________________________________________________________________________
embedding_11473 (Embedding)     (None, 50, 300)      3442200     CAP_input[0][0]                  
                                                                 Noise_input[0][0]                
__________________________________________________________________________________________________
Dense_IMG (Dense)               (None, 1024)         4195328     IMG_input[0][0]                  
__________________________________________________________________________________________________
recurrent_layer (LSTM)          (None, 1024)         5427200     embedding_11473[1][0]            
                                                                 embedding_11473[0][0]            
__________________________________________________________________________________________________
DotProd_postive_score (Dot)     (None, 1)            0           Dense_IMG[0][0]                  
                                                                 recurrent_layer[1][0]            
__________________________________________________________________________________________________
DotProd_negative_score (Dot)    (None, 1)            0           Dense_IMG[0][0]                  
                                                                 recurrent_layer[0][0]            
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 2)            0           DotProd_postive_score[0][0]      
                                                                 DotProd_negative_score[0][0]     
==================================================================================================
Total params: 13,064,728
Trainable params: 9,622,528
Non-trainable params: 3,442,200
__________________________________________________________________________________________________

Retrieval model

In [17]:
# YOUR CODE HERE

# define your model input and output

print ("loading sub-models for retrieving Neural codes")

caption_model = Model(inputs=caption_input, outputs=recurrent_caption)
caption_model.summary()


image_model = Model(inputs=img_input, outputs=condense_img)
image_model.summary()
loading sub-models for retrieving Neural codes
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
CAP_input (InputLayer)       (None, 50)                0         
_________________________________________________________________
embedding_11473 (Embedding)  (None, 50, 300)           3442200   
_________________________________________________________________
recurrent_layer (LSTM)       (None, 1024)              5427200   
=================================================================
Total params: 8,869,400
Trainable params: 5,427,200
Non-trainable params: 3,442,200
_________________________________________________________________
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
IMG_input (InputLayer)       (None, 4096)              0         
_________________________________________________________________
Dense_IMG (Dense)            (None, 1024)              4195328   
=================================================================
Total params: 4,195,328
Trainable params: 4,195,328
Non-trainable params: 0
_________________________________________________________________

Loss function

We define our loss function as a loss for maximizing the margin between a positive and negative example. If we call $p_i$ the score of the positive pair of the $i$-th example, and $n_i$ the score of the negative pair of that example, the loss is:

\begin{equation*} loss = \sum_i{max(0, 1 -p_i + n_i)} \end{equation*}

In [8]:
from keras import backend as K


def max_margin_loss(y_true, y_pred):
    print(y_pred.shape)
    return K.sum(K.maximum(0.0, 1.0 - y_pred[:,0] + y_pred[:,1]))

   

Accuracy metric for max-margin loss

How many times did the positive pair effectively get a higher value than the negative pair?

In [9]:
# YOUR CODE HERE

def accuracy(y_true, y_pred):
    # YOUR CODE HERE
    accuracy_ = K.mean(y_pred[:,0]>y_pred[:,1])
    return accuracy_

Compile model

In [20]:
# DO NOT CHANGE BELOW CODE
print ("compiling the training model")
training_model.compile(optimizer='adam', loss=max_margin_loss, metrics=[accuracy])
image_model.compile(optimizer='adam', loss=max_margin_loss, metrics=[accuracy])
caption_model.compile(optimizer='adam', loss=max_margin_loss, metrics=[accuracy])
#training_model.compile(optimizer='adam', loss=max_margin_loss)
compiling the training model
(?, 2)
(?, 1024)
(?, 1024)

3. Data preparation for training the model

  • adjust the length of captions into fixed maximum length (50 words)
  • sampling caption for each image, while shuffling the image data
  • encode captions into integer format based on look-up vocabulary index
In [10]:
# sampling one caption per image
# return image_ids, caption_ids


#['caps']['ims'] 
# YOUR CODE HERE
    
def sampling_img_cap(data):
    datalen=len(data['ims'])
    image_ids = np.arange(datalen)
    np.random.shuffle(image_ids)
    caption_ids=[image_ids[x]*5+np.random.randint(0, 5) for x in range(datalen)]

    
    return image_ids, caption_ids

#train_image_ids, train_caption_ids = sampling_img_cap(train_data)
In [11]:
# transform raw text caption into integer sequences of fixed maximum length


def make_50(arr):
    return_arr=np.zeros(50,dtype=int)
    limit_id=min(50,len(arr))
    for i in range(limit_id):
        return_arr[i]+=arr[i]
    return return_arr
    
def prepare_caption(caption_ids, caption_data):
    
    # YOUR CODE HERE
    datalen=len(caption_ids)
    #[[float(y) for y in x] for x in l]
    zero=np.zeros(50)
    cap_transformed=[caption_data[caption_ids[x]] for x in range(datalen)]
    caption_seqs = [[words_indices[word] for word in sentence.split() ] for sentence in cap_transformed ]
    caption_seqs=np.asarray([make_50(i) for i in caption_seqs])
    
      
    return caption_seqs

#x_caption = prepare_caption(train_caption_ids, train_caps)
In [12]:
# DO NOT CHANGE BELOW CODE

train_caps = []
for cap in train_data['caps']:
    train_caps.append(cap.decode())

val_caps = []
for cap in val_data['caps']:
    val_caps.append(cap.decode())
In [14]:
# DO NOT CHANGE BELOW CODE

train_image_ids, train_caption_ids = sampling_img_cap(train_data)
val_image_ids, val_caption_ids = sampling_img_cap(val_data)

x_caption = prepare_caption(train_caption_ids, train_caps)
x_image = train_data['ims'][np.array(train_image_ids)]

x_val_caption = prepare_caption(val_caption_ids, val_caps)
x_val_image = val_data['ims'][np.array(val_image_ids)]

4. Create noise set for negative examples of image-fake caption and dummy output

Notice that we do not have real output with labels for training the model. Keras architecture expects labels, so we need to create dummy output -- which is numpy array of zeros. This dummy labels or output is never used since we compute loss function based on margin between positive examples (image-real caption) and negative examples (image-fake caption).

In [15]:
# YOUR CODE HERE

train_noise =np.copy(x_caption)#np.asarray([np.random.randint(0,50,size=50) for y in x_caption])
val_noise = np.copy(x_val_caption)#np.asarray([np.random.randint(0,50,size=50)for y in x_val_caption])

np.random.shuffle(train_noise)
np.random.shuffle(val_noise)    
y_train_labels = np.zeros(10000)#((len(x_image),50))
y_val_labels = np.zeros(5000)#((len(x_val_image),50))

5. Training model

In [15]:
# YOUR CODE HERE

X_train = [x_image,x_caption,train_noise]
Y_train = y_train_labels
X_valid = [x_val_image,x_val_caption,val_noise]
Y_valid = y_val_labels

We run on epoch=1, 100 times, with shuffling the noise set every training cycle. Save the model every 20 loops

In [29]:
for i in range(20):
    np.random.shuffle(train_noise)
    np.random.shuffle(val_noise)
    X_train = [x_image,x_caption,train_noise]
    X_valid = [x_val_image,x_val_caption,val_noise]
    training_model.fit(X_train,Y_train, validation_data=(X_valid, Y_valid), batch_size=100, epochs=1)
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 88.9186 - accuracy: 0.6275 - val_loss: 72.0004 - val_accuracy: 0.7072
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 59.8679 - accuracy: 0.7667 - val_loss: 55.7581 - val_accuracy: 0.7884
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 49.6947 - accuracy: 0.8059 - val_loss: 47.8298 - val_accuracy: 0.8282
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 46.5991 - accuracy: 0.8306 - val_loss: 46.8327 - val_accuracy: 0.8154
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 44.3766 - accuracy: 0.8471 - val_loss: 43.3367 - val_accuracy: 0.8646
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 83s 8ms/step - loss: 53.5498 - accuracy: 0.8065 - val_loss: 52.3378 - val_accuracy: 0.8014
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 47.2686 - accuracy: 0.8425 - val_loss: 43.2155 - val_accuracy: 0.8574
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 40.9306 - accuracy: 0.8658 - val_loss: 39.4020 - val_accuracy: 0.8708
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 48.7477 - accuracy: 0.8486 - val_loss: 47.9059 - val_accuracy: 0.8422
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 83s 8ms/step - loss: 37.8086 - accuracy: 0.8819 - val_loss: 36.8468 - val_accuracy: 0.8732
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 33.4699 - accuracy: 0.8964 - val_loss: 34.9124 - val_accuracy: 0.8910
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 43.4131 - accuracy: 0.8511 - val_loss: 48.3337 - val_accuracy: 0.8362
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 39.2721 - accuracy: 0.8654 - val_loss: 42.3388 - val_accuracy: 0.8558
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 38.0815 - accuracy: 0.8764 - val_loss: 45.6302 - val_accuracy: 0.8492
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 36.2295 - accuracy: 0.8879 - val_loss: 38.2008 - val_accuracy: 0.8696
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 34.8432 - accuracy: 0.8850 - val_loss: 35.7605 - val_accuracy: 0.8814
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 32.2876 - accuracy: 0.8984 - val_loss: 35.0408 - val_accuracy: 0.8850
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 33.1920 - accuracy: 0.8976 - val_loss: 35.0746 - val_accuracy: 0.8836
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 31.7615 - accuracy: 0.8973 - val_loss: 33.3590 - val_accuracy: 0.8962
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 84s 8ms/step - loss: 31.0952 - accuracy: 0.9033 - val_loss: 35.8020 - val_accuracy: 0.8806

Storing 20 iter models and weight parameters

In [ ]:
# DO NOT CHANGE BELOW CODE

# Save model
training_model.save(os.path.join(MODEL_PATH,'20iter_image_caption_model.h5'))
# Save weight parameters
training_model.save_weights(os.path.join(MODEL_PATH, '20iter_weights_image_caption.hdf5'))

# Save model for encoding caption and image
caption_model.save(os.path.join(MODEL_PATH,'20iter_caption_model.h5'))
image_model.save(os.path.join(MODEL_PATH,'20iter_image_model.h5'))
In [31]:
for i in range(20):
    np.random.shuffle(train_noise)
    np.random.shuffle(val_noise)
    X_train = [x_image,x_caption,train_noise]
    X_valid = [x_val_image,x_val_caption,val_noise]
    training_model.fit(X_train,Y_train, validation_data=(X_valid, Y_valid), batch_size=100, epochs=1)
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 79s 8ms/step - loss: 31.8029 - accuracy: 0.9009 - val_loss: 33.6493 - val_accuracy: 0.8886
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 78s 8ms/step - loss: 30.8591 - accuracy: 0.9021 - val_loss: 33.4737 - val_accuracy: 0.8932
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 79s 8ms/step - loss: 29.0788 - accuracy: 0.9130 - val_loss: 30.0855 - val_accuracy: 0.8986
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 79s 8ms/step - loss: 28.1488 - accuracy: 0.9167 - val_loss: 30.8889 - val_accuracy: 0.9010
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 79s 8ms/step - loss: 28.1194 - accuracy: 0.9191 - val_loss: 30.9063 - val_accuracy: 0.9018
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 27.5284 - accuracy: 0.9184 - val_loss: 30.7231 - val_accuracy: 0.9060
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 26.7888 - accuracy: 0.9272 - val_loss: 31.0101 - val_accuracy: 0.9040
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 26.5587 - accuracy: 0.9227 - val_loss: 30.2624 - val_accuracy: 0.9014
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 26.0613 - accuracy: 0.9258 - val_loss: 30.5853 - val_accuracy: 0.9000
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 26.1716 - accuracy: 0.9276 - val_loss: 30.6072 - val_accuracy: 0.9076
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 26.2908 - accuracy: 0.9257 - val_loss: 30.8413 - val_accuracy: 0.9054
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 25.1448 - accuracy: 0.9306 - val_loss: 31.0820 - val_accuracy: 0.9072
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 25.0128 - accuracy: 0.9351 - val_loss: 29.0238 - val_accuracy: 0.9172
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 24.4169 - accuracy: 0.9351 - val_loss: 29.7360 - val_accuracy: 0.9078
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 25.0697 - accuracy: 0.9338 - val_loss: 30.6383 - val_accuracy: 0.9108
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 24.2010 - accuracy: 0.9398 - val_loss: 30.8700 - val_accuracy: 0.9064
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 25.7912 - accuracy: 0.9343 - val_loss: 30.4451 - val_accuracy: 0.9130
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 25.6140 - accuracy: 0.9353 - val_loss: 30.6613 - val_accuracy: 0.9116
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 24.7096 - accuracy: 0.9373 - val_loss: 29.9828 - val_accuracy: 0.9228
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 28.1223 - accuracy: 0.9321 - val_loss: 34.8587 - val_accuracy: 0.8934

Storing 40 iter models and weight parameters

In [32]:
# DO NOT CHANGE BELOW CODE

# Save model
training_model.save(os.path.join(MODEL_PATH,'60iter_image_caption_model.h5'))
# Save weight parameters
training_model.save_weights(os.path.join(MODEL_PATH, '40iter_weights_image_caption.hdf5'))

# Save model for encoding caption and image
caption_model.save(os.path.join(MODEL_PATH,'40iter_caption_model.h5'))
image_model.save(os.path.join(MODEL_PATH,'40iter_image_model.h5'))
In [33]:
for i in range(20):
    np.random.shuffle(train_noise)
    np.random.shuffle(val_noise)
    X_train = [x_image,x_caption,train_noise]
    X_valid = [x_val_image,x_val_caption,val_noise]
    training_model.fit(X_train,Y_train, validation_data=(X_valid, Y_valid), batch_size=100, epochs=1)
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 79s 8ms/step - loss: 35.4887 - accuracy: 0.9122 - val_loss: 43.0359 - val_accuracy: 0.8714
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 78s 8ms/step - loss: 39.7188 - accuracy: 0.8941 - val_loss: 38.1459 - val_accuracy: 0.8872
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 79s 8ms/step - loss: 37.5093 - accuracy: 0.8998 - val_loss: 37.6459 - val_accuracy: 0.8926
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 79s 8ms/step - loss: 36.5781 - accuracy: 0.9063 - val_loss: 39.8724 - val_accuracy: 0.8840
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 79s 8ms/step - loss: 35.8946 - accuracy: 0.9067 - val_loss: 36.7916 - val_accuracy: 0.8976
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 79s 8ms/step - loss: 32.9292 - accuracy: 0.9141 - val_loss: 34.5532 - val_accuracy: 0.9028
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 31.2582 - accuracy: 0.9192 - val_loss: 35.6500 - val_accuracy: 0.8934
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 30.7882 - accuracy: 0.9161 - val_loss: 34.8685 - val_accuracy: 0.9016
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 33.4218 - accuracy: 0.9187 - val_loss: 35.2201 - val_accuracy: 0.9024
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 36.4264 - accuracy: 0.9111 - val_loss: 50.5845 - val_accuracy: 0.8580
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 39.8223 - accuracy: 0.8961 - val_loss: 39.4282 - val_accuracy: 0.8838
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 36.3748 - accuracy: 0.9122 - val_loss: 35.9522 - val_accuracy: 0.8932
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 32.0661 - accuracy: 0.9144 - val_loss: 32.8601 - val_accuracy: 0.9066
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 29.4459 - accuracy: 0.9239 - val_loss: 39.9438 - val_accuracy: 0.8862
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 28.7025 - accuracy: 0.9249 - val_loss: 30.4497 - val_accuracy: 0.9126
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 26.8221 - accuracy: 0.9289 - val_loss: 28.8954 - val_accuracy: 0.9204
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 26.4795 - accuracy: 0.9353 - val_loss: 33.1146 - val_accuracy: 0.8952
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 26.7862 - accuracy: 0.9322 - val_loss: 31.1030 - val_accuracy: 0.9130
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 26.8123 - accuracy: 0.9348 - val_loss: 30.0436 - val_accuracy: 0.9120
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 25.2695 - accuracy: 0.9399 - val_loss: 30.9186 - val_accuracy: 0.9102

Storing 60 iter models and weight parameters

In [34]:
# DO NOT CHANGE BELOW CODE

# Save model
training_model.save(os.path.join(MODEL_PATH,'60iter_image_caption_model.h5'))
# Save weight parameters
training_model.save_weights(os.path.join(MODEL_PATH, '60iter_weights_image_caption.hdf5'))

# Save model for encoding caption and image
caption_model.save(os.path.join(MODEL_PATH,'60iter_caption_model.h5'))
image_model.save(os.path.join(MODEL_PATH,'60iter_image_model.h5'))
In [16]:
for i in range(20):
    np.random.shuffle(train_noise)
    np.random.shuffle(val_noise)
    X_train = [x_image,x_caption,train_noise]
    X_valid = [x_val_image,x_val_caption,val_noise]
    training_model.fit(X_train,Y_train, validation_data=(X_valid, Y_valid), batch_size=100, epochs=1)
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 27.9112 - acc: 0.9223 - val_loss: 29.1141 - val_acc: 0.9238
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 25.6556 - acc: 0.9346 - val_loss: 29.2866 - val_acc: 0.9172
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 25.4950 - acc: 0.9326 - val_loss: 29.4254 - val_acc: 0.9196
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 25.4173 - acc: 0.9360 - val_loss: 28.2469 - val_acc: 0.9244
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 79s 8ms/step - loss: 25.7053 - acc: 0.9386 - val_loss: 30.0783 - val_acc: 0.9124
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 25.3843 - acc: 0.9423 - val_loss: 28.4156 - val_acc: 0.9264
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 83s 8ms/step - loss: 24.9105 - acc: 0.9426 - val_loss: 27.3250 - val_acc: 0.9298
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 24.2559 - acc: 0.9419 - val_loss: 27.3747 - val_acc: 0.9304
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 26.3785 - acc: 0.9409 - val_loss: 28.9593 - val_acc: 0.9244
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 83s 8ms/step - loss: 24.5036 - acc: 0.9446 - val_loss: 28.8915 - val_acc: 0.9190
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 83s 8ms/step - loss: 25.3832 - acc: 0.9381 - val_loss: 28.6401 - val_acc: 0.9220
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 25.6813 - acc: 0.9360 - val_loss: 28.0302 - val_acc: 0.9238
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 24.9288 - acc: 0.9396 - val_loss: 28.2650 - val_acc: 0.9206
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 23.7961 - acc: 0.9445 - val_loss: 29.0840 - val_acc: 0.9226
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 83s 8ms/step - loss: 24.0325 - acc: 0.9444 - val_loss: 28.1972 - val_acc: 0.9238
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 79s 8ms/step - loss: 23.5293 - acc: 0.9468 - val_loss: 29.2705 - val_acc: 0.9222
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 78s 8ms/step - loss: 23.1189 - acc: 0.9501 - val_loss: 28.2344 - val_acc: 0.9248
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 78s 8ms/step - loss: 22.3297 - acc: 0.9516 - val_loss: 29.0275 - val_acc: 0.9290
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 78s 8ms/step - loss: 23.1193 - acc: 0.9489 - val_loss: 26.8763 - val_acc: 0.9360
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 78s 8ms/step - loss: 21.8457 - acc: 0.9535 - val_loss: 28.3960 - val_acc: 0.9262

Storing 80 iter models and weight parameters

In [17]:
# DO NOT CHANGE BELOW CODE

# Save model
training_model.save(os.path.join(MODEL_PATH,'80iter_image_caption_model.h5'))
# Save weight parameters
training_model.save_weights(os.path.join(MODEL_PATH, '80iter_weights_image_caption.hdf5'))

# Save model for encoding caption and image
caption_model.save(os.path.join(MODEL_PATH,'80iter_caption_model.h5'))
image_model.save(os.path.join(MODEL_PATH,'80iter_image_model.h5'))
In [15]:
for i in range(20):
    np.random.shuffle(train_noise)
    np.random.shuffle(val_noise)
    X_train = [x_image,x_caption,train_noise]
    X_valid = [x_val_image,x_val_caption,val_noise]
    training_model.fit(X_train,Y_train, validation_data=(X_valid, Y_valid), batch_size=100, epochs=1)
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 23.8784 - acc: 0.9452 - val_loss: 28.5689 - val_acc: 0.9246
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 23.1777 - acc: 0.9503 - val_loss: 27.7301 - val_acc: 0.9324
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 24.0717 - acc: 0.9428 - val_loss: 28.2736 - val_acc: 0.9228
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 24.8452 - acc: 0.9423 - val_loss: 28.4980 - val_acc: 0.9298
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 23.5946 - acc: 0.9511 - val_loss: 27.4053 - val_acc: 0.9324
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 22.4615 - acc: 0.9539 - val_loss: 28.0386 - val_acc: 0.9278
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 22.4470 - acc: 0.9522 - val_loss: 28.0424 - val_acc: 0.9288
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 80s 8ms/step - loss: 21.2503 - acc: 0.9578 - val_loss: 27.3800 - val_acc: 0.9362
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 21.1884 - acc: 0.9576 - val_loss: 27.1148 - val_acc: 0.9298
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 21.2396 - acc: 0.9606 - val_loss: 27.8228 - val_acc: 0.9348
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 20.8207 - acc: 0.9570 - val_loss: 26.5645 - val_acc: 0.9370
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 21.1339 - acc: 0.9572 - val_loss: 27.1834 - val_acc: 0.9358
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 20.8824 - acc: 0.9588 - val_loss: 26.8585 - val_acc: 0.9316
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 21.1073 - acc: 0.9622 - val_loss: 26.5123 - val_acc: 0.9370
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 20.6118 - acc: 0.9639 - val_loss: 26.9167 - val_acc: 0.9342
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 20.1303 - acc: 0.9644 - val_loss: 27.6342 - val_acc: 0.9336
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 20.0932 - acc: 0.9647 - val_loss: 27.3283 - val_acc: 0.9392
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 81s 8ms/step - loss: 20.0959 - acc: 0.9621 - val_loss: 26.4353 - val_acc: 0.9392
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 82s 8ms/step - loss: 19.1649 - acc: 0.9681 - val_loss: 26.3438 - val_acc: 0.9386
Train on 10000 samples, validate on 5000 samples
Epoch 1/1
10000/10000 [==============================] - 79s 8ms/step - loss: 18.9599 - acc: 0.9664 - val_loss: 27.6475 - val_acc: 0.9318

Storing 100 iter models and weight parameters

In [16]:
# DO NOT CHANGE BELOW CODE

# Save model
training_model.save(os.path.join(MODEL_PATH,'100iter_image_caption_model.h5'))
# Save weight parameters
training_model.save_weights(os.path.join(MODEL_PATH, '100iter_weights_image_caption.hdf5'))

# Save model for encoding caption and image
caption_model.save(os.path.join(MODEL_PATH,'100iter_caption_model.h5'))
image_model.save(os.path.join(MODEL_PATH,'100iter_image_model.h5'))
In [126]:
training_model.summary()
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
CAP_input (InputLayer)          (None, 50)           0                                            
__________________________________________________________________________________________________
Noise_input (InputLayer)        (None, 50)           0                                            
__________________________________________________________________________________________________
IMG_input (InputLayer)          (None, 4096)         0                                            
__________________________________________________________________________________________________
embedding_45892 (Embedding)     (None, 50, 300)      3442200     CAP_input[0][0]                  
                                                                 Noise_input[0][0]                
__________________________________________________________________________________________________
Dense_IMG (Dense)               (None, 1024)         4195328     IMG_input[0][0]                  
__________________________________________________________________________________________________
recurrent_layer (LSTM)          (None, 1024)         5427200     embedding_45892[1][0]            
                                                                 embedding_45892[0][0]            
__________________________________________________________________________________________________
DotProd_postive_score (Dot)     (None, 1)            0           Dense_IMG[0][0]                  
                                                                 recurrent_layer[1][0]            
__________________________________________________________________________________________________
DotProd_negative_score (Dot)    (None, 1)            0           Dense_IMG[0][0]                  
                                                                 recurrent_layer[0][0]            
__________________________________________________________________________________________________
concatenate_5 (Concatenate)     (None, 2)            0           DotProd_postive_score[0][0]      
                                                                 DotProd_negative_score[0][0]     
==================================================================================================
Total params: 13,064,728
Trainable params: 9,622,528
Non-trainable params: 3,442,200
__________________________________________________________________________________________________

Loading models and weight parameters (current used: 100 iterations model)

In [16]:
from keras.models import load_model


training_model=load_model(os.path.join(MODEL_PATH,'100iter_image_caption_model.h5'), custom_objects={'max_margin_loss': max_margin_loss})

# Load model for encoding caption and image
caption_model=load_model(os.path.join(MODEL_PATH,'100iter_caption_model.h5'), custom_objects={'max_margin_loss': max_margin_loss})
image_model=load_model(os.path.join(MODEL_PATH,'100iter_image_model.h5'), custom_objects={'max_margin_loss': max_margin_loss})
(?, 2)
(?, 1024)
(?, 1024)

6. Feature extraction (Neural codes)

In [17]:
# YOUR CODE HERE

# Use caption_model and image_model to produce "Neural codes" 
# for both image and caption from validation set

img_model_nc = Model(inputs=image_model.input, outputs=image_model.get_layer("Dense_IMG").output)
cap_model_nc = Model(inputs=caption_model.input, outputs=caption_model.get_layer("recurrent_layer").get_output_at(0))
In [18]:
#nc_img = img_model_nc.predict(np.append(x_image,x_val_image,axis=0))
nc_img = img_model_nc.predict(x_image)
nc_img_val = img_model_nc.predict(x_val_image)
nc_cap = cap_model_nc.predict(x_caption)
nc_cap_val = cap_model_nc.predict(x_val_caption)

# print the shapes to confirm all features are 1024-dimensional
print(nc_img.shape)
print(nc_img_val.shape)
print(nc_cap.shape)
print(nc_cap_val.shape)
(10000, 1024)
(5000, 1024)
(10000, 1024)
(5000, 1024)

7. Caption Retrieval

Display original image as query and its ground truth caption

In [19]:
import matplotlib.pyplot as plt
%matplotlib inline
from keras.preprocessing import image
In [20]:
def show_img(imagename):
    img = image.load_img(os.path.join(IMAGE_DATA,imagename), target_size=(224,224))
    plt.imshow(img)
    plt.axis("off")
    plt.show()
In [3]:
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input, decode_predictions
vgg_model = VGG16(weights='imagenet')
fc2_model = Model(inputs=vgg_model.input, outputs=vgg_model.get_layer("fc2").output)

def load_img_preprocess(img_path):
    img = image.load_img(img_path, target_size=(224, 224))
    array = image.img_to_array(img)
    x = np.expand_dims(array, axis=0)
    x = preprocess_input(x)
    return {"img": img, "array": array, "x": x}

#elephant1 = load_img_preprocess('val2014/COCO_val2014_000000000073.jpg')

def show_image_predictions(img_obj):
    plt.imshow(img_obj["img"])
    plt.show()
    preds = vgg_model.predict(img_obj["x"])
    preds_dec = decode_predictions(preds, top=5)[0]
    print("Predictions:")
    for pred in preds_dec:
        print("{}, with probability: {}".format(pred[1],pred[2]))
    print("")
#show_image_predictions(elephant1)

def include_features(img_obj):
    img_obj["fc2"] = fc2_model.predict(img_obj["x"])
    
#include_features(elephant1)


def get_features(imgname):
    elephant1=load_img_preprocess('val2014/'+imgname)
    include_features(elephant1)
    return np.array(elephant1["fc2"])
    
#get_features('COCO_val2014_000000000073.jpg')
In [68]:
# YOUR CODE HERE

# choose one image_id from validation set
# use this id to get filepath of image
img_id = 33499
filepath_image = 'COCO_val2014_000000033499.jpg' 

# display original caption
original_caption = find_original_caption(img_id)
print(original_caption)
# DO NOT CHANGE BELOW CODE
#show_img(filepath_image)
img = image.load_img(os.path.join(IMAGE_DATA,filepath_image), target_size=(224,224))
plt.imshow(img)
plt.axis("off")
plt.show()
['Closeup of a black and gold clock with blue sky in background.', 'a clock on a pole with a sky background ', 'A clock tower reads the time as 3:00.', 'Old clock showing the time against the sky in the afternoon.', 'An ornate outdoor clock with streetlamps above and below it.']
In [39]:
# function to retrieve caption, given an image query
from sklearn.neighbors import NearestNeighbors

def samearr(a,b):
    same=True
    for i,_ in enumerate(a):
        same=(same and a[i]==b[i])
    return same

def get_caption(image_filename, n=10):
    feature=get_features(image_filename)
    
    show_img(image_filename)
    img_id=-1
    for img in coco_instances_val['images']:
        if(img['file_name']==image_filename):
            img_id=img['id']
    if(img_id==-1):
        return Exception('Pic metadata (json file) not found!')
    orig_cap=imgid2cap(img_id)
    print('original caption:')
    print(orig_cap)
    
    idtrain=-1
    
    rep1024=img_model_nc.predict(feature)
    neigh = NearestNeighbors(n_neighbors=n, p=2)
    neigh.fit(nc_cap)
    nn = neigh.kneighbors(rep1024)
    #print(nn[1])
    for i in range(n):
        print("guessed cap: {}, distance={}".format(true_caption(x_caption[nn[1][0][i]]),nn[0][0][i]))
    return nn
    # YOUR CODE HERE

    
In [78]:
get_caption('COCO_val2014_000000504439.jpg')
original caption:
['Two zebras, one walking toward the camera, one walking away.', 'a couple of zebras are standing in a pin', 'Two zebras that are facing different directions in an enclosure.', 'Too zebras stand together near a wooded area.', 'two zebras going opposite directions of each other in outdoor enclosure']
guessed cap: a group of zebras under a huge shade tree in the middle of a grassy field, distance=579.3861066522182
guessed cap: a couple of zebras are standing in a field, distance=579.3988857603442
guessed cap: one adult zebra and one young zebra walking through the grass, distance=579.4034436081217
guessed cap: a couple of cows tied to a fence in a city area, distance=579.4048080549875
guessed cap: a couple of elephants fighting in a grass area, distance=579.4094073301742
guessed cap: a couple of brown horses walking down a street next to buildings, distance=579.4130759595523
guessed cap: several adult elephants and a young elephant in a field of grass, distance=579.4203795988376
guessed cap: zebras are grazing in the grass with a mountain in the distance, distance=579.4214221167928
guessed cap: horses standing and laying in grass near a body of water and a mountain, distance=579.4215182734806
guessed cap: a couple of zebras in a dirt field, distance=579.4217347831027
Out[78]:
(array([[579.38610665, 579.39888576, 579.40344361, 579.40480805,
         579.40940733, 579.41307596, 579.4203796 , 579.42142212,
         579.42151827, 579.42173478]]),
 array([[3771, 7067, 4059, 2087,  137, 4992, 8478, 3978, 4634,  393]],
       dtype=int64))
In [81]:
# DO NOT CHANGE BELOW CODE
get_caption('COCO_val2014_000000510182.jpg')
original caption:
['a lady in a blue dress on tennis court bouncing ball.', 'A GIRL IS ON THE COURT PLAYING TENNIS', 'A female tennis player holding a racquet in one hand and throwing a ball with the other', 'A woman with a tennis racket tosses a tennis ball.', 'A woman holding a tennis racquet in her right hand.']
guessed cap: a beautiful young lady holding a tennis racquet on a tennis court, distance=615.7951991867158
guessed cap: a young woman holding a tennis racquet on a tennis court, distance=615.8034918738239
guessed cap: a man holding a tennis racquet on a tennis court, distance=615.8083841054718
guessed cap: a man holding a tennis racquet on a tennis court, distance=615.8083841054718
guessed cap: a man holding a tennis racquet on a tennis court, distance=615.8083841054718
guessed cap: a woman standing on a tennis court holding a racquet, distance=615.8095857778733
guessed cap: a woman in a tennis outfit poses with her racquet, distance=615.8120767414929
guessed cap: a woman holding a tennis racquet on a tennis court, distance=615.8225426030581
guessed cap: a man with a tennis racquet about to swing, distance=615.8248662079233
guessed cap: a young woman holding a tennis racquet in front of a ball, distance=615.8261494432802
Out[81]:
(array([[615.79519919, 615.80349187, 615.80838411, 615.80838411,
         615.80838411, 615.80958578, 615.81207674, 615.8225426 ,
         615.82486621, 615.82614944]]),
 array([[4177, 1567, 4707, 4498, 2353, 2938, 1918, 2409, 3595, 1114]],
       dtype=int64))
In [42]:
get_caption('COCO_val2014_000000019292.jpg')
original caption:
['A group of skiers on a snow covered hill.', 'A group of people skiing on a snow covered summit.', 'a group of people riding skis on a snow covered hill', 'A bunch of people cross country skiing on a large area of white snow.', 'A group of people skiing on the snowy mountain side.']
guessed cap: a couple of skis beneath two ski poles, distance=552.624975427202
guessed cap: a man is water skiing behind a motorboat, distance=552.6304797384897
guessed cap: a bunch of skiers standing on a ski slope, distance=552.6323517037237
guessed cap: a man riding a skateboard at a skate park, distance=552.638147354629
guessed cap: a man riding a skateboard at a skate park, distance=552.638147354629
guessed cap: a young skier is skiing downhill in a competition, distance=552.6403673458632
guessed cap: a person riding skis on a snowy surface, distance=552.642040158661
guessed cap: a person riding skis on a snowy surface, distance=552.642040158661
guessed cap: a person riding a surf board with a paddle, distance=552.6425441643012
guessed cap: a woman riding skis down a ski slope holding ski poles, distance=552.642922635242
Out[42]:
(array([[552.62497543, 552.63047974, 552.6323517 , 552.63814735,
         552.63814735, 552.64036735, 552.64204016, 552.64204016,
         552.64254416, 552.64292264]]),
 array([[6259, 3958, 8902, 8297, 9614, 4809, 2604, 6054, 8536, 5564]],
       dtype=int64))
In [44]:
get_caption('COCO_val2014_000000033499.jpg')
original caption:
['Closeup of a black and gold clock with blue sky in background.', 'a clock on a pole with a sky background ', 'A clock tower reads the time as 3:00.', 'Old clock showing the time against the sky in the afternoon.', 'An ornate outdoor clock with streetlamps above and below it.']
guessed cap: a very tall clock tower sitting along side of a bridge, distance=378.4756092817438
guessed cap: an old building has stone arches and wooden doors, distance=378.4855601665756
guessed cap: a large toy like clock tower in a parking lot, distance=378.50612068471554
guessed cap: tall clock tower with a weather vein on a clear day, distance=378.5164317597138
guessed cap: there is a very tall tower that has a clock on it, distance=378.5798924484081
guessed cap: a large clock on the side of a building with trees below, distance=378.59711891877953
guessed cap: a tall building with a clock on it near a cemetery, distance=378.5979242087232
guessed cap: a row of plants growing in the ground by a building, distance=378.62857771519117
guessed cap: yellow shoes sitting next to a suitcase with red lining, distance=378.63280213898594
guessed cap: a number of luggage bags and boxes on the back of a truck, distance=378.63484489941305
Out[44]:
(array([[378.47560928, 378.48556017, 378.50612068, 378.51643176,
         378.57989245, 378.59711892, 378.59792421, 378.62857772,
         378.63280214, 378.6348449 ]]),
 array([[8401, 5784, 2387, 8588, 4416,  285, 5051, 9279,  887, 9218]],
       dtype=int64))
In [47]:
# DO NOT CHANGE BELOW CODE
get_caption('COCO_val2014_000000052005.jpg')
original caption:
['The yellow fire hydrant is next to the bushes.', 'A picture of a fire hydrant next to a plant.', 'The yellow and black fire hydrant has leaves around it.', 'A water hydrant sits in front of foliage.', 'the fire hydrant is yellow and is next to green plants']
guessed cap: the fire hydrant sits in front of the grafitti splattered wall, distance=222.10204211169597
guessed cap: this is a fire hydrant sitting on the sidewalk, distance=222.14276993889936
guessed cap: a picture of a fire hydrant next to a plant, distance=222.14307801587302
guessed cap: a fire hydrant on a sidewalk near a street, distance=222.1452565367234
guessed cap: a fire hydrant near a concrete barricade with trash around it, distance=222.14757529465868
guessed cap: a red fire hydrant next to lampposts on a street, distance=222.14828646500118
guessed cap: a fire hydrant next to a low brick wall, distance=222.15614511678098
guessed cap: a large tree in a yard on a street corner, distance=222.1635678194624
guessed cap: a fire hydrant sits on the side of the road, distance=222.16664686050726
guessed cap: the mirror is beside a red traffic signal, distance=222.17002002445636
Out[47]:
(array([[222.10204211, 222.14276994, 222.14307802, 222.14525654,
         222.14757529, 222.14828647, 222.15614512, 222.16356782,
         222.16664686, 222.17002002]]),
 array([[5969, 7797, 7572, 6627, 9740, 3563, 9696, 9959, 5548, 9967]],
       dtype=int64))
In [49]:
# DO NOT CHANGE BELOW CODE
get_caption('COCO_val2014_000000064629.jpg')
original caption:
['A large propeller airplane flying through a blue sky.', 'The airplane is flying through the sky during the day.', 'a four engine propeller jet flies through the sky', 'An airplane in the sky during the day time. ', 'A prop airplane flying on a cloudless day.']
guessed cap: a united jet liner flying through the air, distance=542.9096921246183
guessed cap: a photo of a steam engine train going by, distance=542.9161644381218
guessed cap: a passenger jet with orange and red designs shown flying, distance=542.9197039340654
guessed cap: a large passenger jet flying over the ocean, distance=542.924114758825
guessed cap: looking through the window of showroom at car dealership, distance=542.9252462173572
guessed cap: a public transit train going through a station, distance=542.9264636531454
guessed cap: this is a parking garage with several signs, distance=542.9283095142474
guessed cap: an passenger train travels down the track under power lines, distance=542.928663396462
guessed cap: subway sitting on metal tracks next to walls, distance=542.9334956904521
guessed cap: a fire hydrant sitting next a street pole, distance=542.9354359805939
Out[49]:
(array([[542.90969212, 542.91616444, 542.91970393, 542.92411476,
         542.92524622, 542.92646365, 542.92830951, 542.9286634 ,
         542.93349569, 542.93543598]]),
 array([[4036, 3620, 6375, 4961, 3311, 5162, 2527, 3683, 1837, 1832]],
       dtype=int64))
In [72]:
# DO NOT CHANGE BELOW CODE
get_caption('COCO_val2014_000000490081.jpg')
original caption:
['Two white horses pulling a man on a wagon down a dirt road.', 'Man rides on the back of a wooden cart being pulled by two cows. ', 'Two ox are pulling a man on a cart.', 'A man in a wooden ox cart is being pulled.', 'a person riding a carriage being pulled by bulls']
guessed cap: several jockeys are riding horses on the grass, distance=651.3022861249343
guessed cap: a bunch of jockeys riding their horses around a track, distance=651.3223032679166
guessed cap: a lady riding a horse and others horse standing next to it, distance=651.3312563198652
guessed cap: a person on horse with a dog hearding cows in a pasture, distance=651.338778970434
guessed cap: a woman looking at a herd of sheep in a field, distance=651.3524207884657
guessed cap: people pet an elephant at an outdoor zoo exhibit, distance=651.3524701261524
guessed cap: a man and a large herd of goats in a desert setting, distance=651.3584285734846
guessed cap: a jockey riding a horse jumping a hurdle, distance=651.3597431654985
guessed cap: a statue of a man riding a horse sits atop the sculpture in the square, distance=651.3608960361763
guessed cap: a man herding sheep down a busy road, distance=651.3666946524247
Out[72]:
(array([[651.30228612, 651.32230327, 651.33125632, 651.33877897,
         651.35242079, 651.35247013, 651.35842857, 651.35974317,
         651.36089604, 651.36669465]]),
 array([[9912, 5287, 8550,  473, 4654, 8716, 2896, 2960, 2721, 8436]],
       dtype=int64))
In [73]:
# DO NOT CHANGE BELOW CODE
get_caption('COCO_val2014_000000495288.jpg')
original caption:
['A blue and green plaid tie with a flag pin on it.', 'A pin is attached to a plaid tie.', 'There is a plaid tie with a pin attached to it.', 'A tie has a flag shaped tie pin.', "A man's navy shirt front with a paid tie and a flag tie clip."]
guessed cap: a woman sitting on a road with a suitcase, distance=201.45053729611038
guessed cap: a woman standing by the road while talking on a cell phone, distance=201.45923529762968
guessed cap: a couple of men standing next to a truck, distance=201.46092565396054
guessed cap: a person wearing a suit and tie near a car, distance=201.47831435729594
guessed cap: a young couple in a car on their wedding day, distance=201.48982657005442
guessed cap: a close up of a child in a car seat with a doughnut, distance=201.49416665864794
guessed cap: this is a little girl standing in an airport, distance=201.50743081562703
guessed cap: a couple of people sitting inside of a car, distance=201.51349174387082
guessed cap: this is an image of a man in uniform on a sidewalk, distance=201.5257282927916
guessed cap: a woman decorating a parking meter with fake flowers, distance=201.52584348434772
Out[73]:
(array([[201.4505373 , 201.4592353 , 201.46092565, 201.47831436,
         201.48982657, 201.49416666, 201.50743082, 201.51349174,
         201.52572829, 201.52584348]]),
 array([[8222,    4, 1682,  764, 9966, 9301, 6196,  151, 7987, 8004]],
       dtype=int64))
In [74]:
# DO NOT CHANGE BELOW CODE
get_caption('COCO_val2014_000000497106.jpg')
original caption:
['A young girl in a fairy dress under an umbrella', 'Small girl in a princess outfit standing on the side of a path with an umbrella.', 'A person with an umbrella walking through the grass.', 'A child wearing a dress with wings under an umbrella outdoors.', 'A young girl in a princess costume and an umbrella. ']
guessed cap: a close up of a child in a car seat with a doughnut, distance=279.8373060694917
guessed cap: a female with blue luggage awaits her train, distance=279.882210727376
guessed cap: a person wearing a suit and tie near a car, distance=279.8976688664358
guessed cap: a couple of people sitting inside of a car, distance=279.90285854362395
guessed cap: a young couple in a car on their wedding day, distance=279.91363496690633
guessed cap: elderly couple sitting on black city bench, distance=279.915158277662
guessed cap: a person is sitting in the back of a carriage, distance=279.92270637642014
guessed cap: two women that are sitting on a bus, distance=279.92614654316145
guessed cap: two tourists taking photos of the city with their phones, distance=279.9332244840515
guessed cap: a woman standing by the road while talking on a cell phone, distance=279.9435288223253
Out[74]:
(array([[279.83730607, 279.88221073, 279.89766887, 279.90285854,
         279.91363497, 279.91515828, 279.92270638, 279.92614654,
         279.93322448, 279.94352882]]),
 array([[9301, 3046,  764,  151, 9966, 7867, 9114, 8281, 6802,    4]],
       dtype=int64))
In [79]:
# DO NOT CHANGE BELOW CODE
get_caption('COCO_val2014_000000505035.jpg')
original caption:
['A little girl is making a huge mess with a birthday cake. ', 'A baby with a bib eats a cake.', 'A young baby is eating and playing with some cake.', 'A child in a booster chair eating a cake ', 'a small child is sitting on a seat']
guessed cap: a beautiful blonde girl holding a nintendo wii controller next to a man, distance=442.95716316873484
guessed cap: a little girl in a red and white dress on a cellphone, distance=442.9618501383845
guessed cap: a woman holding a smart device in her hands, distance=442.96415026077113
guessed cap: a young girl sits holding a phone looking aghast, distance=442.9673272841423
guessed cap: a little girl talking on a cell phone while sitting down, distance=442.9759439855677
guessed cap: a woman sitting a chair while talking on a cell phone, distance=442.9799520451136
guessed cap: two women having lunch together while man stood by them, distance=442.98252381842207
guessed cap: a little girl holding a cell phone in her hands, distance=442.9858832297699
guessed cap: a man holding his cell phone up in a crowd, distance=442.98667638071555
guessed cap: a girl is playing on a laptop in a computer lab, distance=442.98780264515347
Out[79]:
(array([[442.95716317, 442.96185014, 442.96415026, 442.96732728,
         442.97594399, 442.97995205, 442.98252382, 442.98588323,
         442.98667638, 442.98780265]]),
 array([[3549,  595, 1734, 7615, 1305,  949, 2857, 1548, 5966, 1418]],
       dtype=int64))
In [83]:
# DO NOT CHANGE BELOW CODE
get_caption('COCO_val2014_000000511241.jpg')
original caption:
['Lunch of rice and beans with soup and juice.', 'A dinner table with beverages and dishes of chickpeas, rice, greens, and soup.', 'there are many plates of food on this table', 'A plate loaded with food on a table', 'a table full of various plates of food']
guessed cap: piece of cooked meat with mushrooms with a side of mashed potatoes and broccoli, distance=518.7397293574421
guessed cap: close up view of fried chicken waffles fruit and other food, distance=518.7399570360645
guessed cap: eggs toast and healthy fruits are on a plate, distance=518.7401071510546
guessed cap: a half eaten sandwich next to a bowl of chili and an apple, distance=518.7401075421241
guessed cap: a piece of chicken sitting on a plate with some vegetables, distance=518.7403750561251
guessed cap: a group of uncooked cinnamon rolls on a pan in an oven, distance=518.740624297498
guessed cap: a partially eaten sandwich with steak and onions, distance=518.7408100152691
guessed cap: chicken sandwich with lettuce and tomatoes on a plate with french fries, distance=518.7408452723391
guessed cap: some sort of chicken dish with broccoli spears on side of plate, distance=518.7408722796519
guessed cap: a choice of poached eggs and bacon on a bagel or donuts, distance=518.7410960276941
Out[83]:
(array([[518.73972936, 518.73995704, 518.74010715, 518.74010754,
         518.74037506, 518.7406243 , 518.74081002, 518.74084527,
         518.74087228, 518.74109603]]),
 array([[9450, 4680, 8529, 2067, 2934, 8344, 2968, 3964, 4207, 9475]],
       dtype=int64))

Briefly discuss the result. Why or how it works, and why do you think it does not work at some point.

Answer:

How does it works:

First, we make a main training model that has 3 inputs: a VGG-neural code representation of an image (4096 dimensions), array of integer representing a sequence of translated captions, and array of integer representing a sequence of translated caption noises. We also create 2 additional models for image and caption that uses layer from the training model. The loss that used is max margin loss as define in the notebook and the accuracy defined as the number of positive pair scoring higher than negative pair on average.

The noise caption defined as the real caption shuffled. We train the model 100 times, with each iterations,we use 1 epoch and reshuffle the captions. For retrieval model, we extract the neural code from image and caption model respectively. For image to caption task, we read the image representation, then use neural codes from VGG16 layer that has pre trained using IMAGENET, and then get the neural representaions using neural codes from image model. Lastly, KNN was used to determine the closest caption from image neural representation using L2-distance.

Result discussion:

The results are decent. Most of the captions retrieved are in the same realm as the original image. Meaning the caption retrieved is about food and the original caption was also food related. However the retrieved caption might not have the specific food correct. Sometimes for example in the girl making a mess of her food, the retrieved captions get correct that there is a girl but miss out on the food part.

Why it works:

The network might identify certain features of an image correct and encode them the captions related to this feature are then probable to be nearby the query. If an image is closely related to an image from the training set the resulting captions are pretty accurate.

Why it does not works:

Sometimes an image has some feature which was not known by the network for example the highchair in the image of the child. It then identifies the tray as some sort of device. This device gets encoded into the vector and then caption of a girl with a device are returned. The fault here is in unseen features.

8. Image Retrieval

In [30]:
# given text query, display retrieved image, similarity score, and its original caption 

def search_image(text_caption, n=10):
    caption=np.zeros(50,dtype=int)
    for id,i in enumerate(text_caption.split()):
        if(id<50):
            try:
                caption[id]=words_indices[i]
            except KeyError:
                caption[id]=0
    neigh = NearestNeighbors(n_neighbors=n, p=2)
    neigh.fit(nc_img)
    rep=cap_model_nc.predict(np.array([caption]))
    nn = neigh.kneighbors(rep)
    for id,i in enumerate(nn[1][0]):
        print(mapto(i))
        print('Distance = {}'.format(nn[0][0][id]))
    
    # YOUR CODE HERE
    

Consider to use the following settings for image retrieval task.

  • use real caption that is available in validation set as a query.
  • use part of caption as query. For instance, instead of use the whole text sentence of the caption, you may consider to use key phrase or combination of words that is included in corresponding caption.
In [62]:
# Example of text query 
# text = 'two giraffes standing near trees'

# YOUR QUERY-1
text1 = 'two giraffes standing near trees'

# DO NOT CHANGE BELOW CODE
search_image(text1)
None
Distance = 6.027970748372095
None
Distance = 6.154782074328628
None
Distance = 6.171838529693531
None
Distance = 6.1764041776952805
None
Distance = 6.311391903567977
None
Distance = 6.348504052400844
None
Distance = 6.358937059596896
None
Distance = 6.389341586305836
None
Distance = 6.410594533998857
None
Distance = 6.459625998219425
In [70]:
# Example of text query 
# text = 'two giraffes standing near trees'

# YOUR QUERY-1
text1 = 'a clock tower extends into the sky in a cit'

# DO NOT CHANGE BELOW CODE
search_image(text1,n=5)
None
Distance = 6.326488319268053
None
Distance = 6.423302656997315
None
Distance = 6.5338104688767285
None
Distance = 6.688364349023717
None
Distance = 6.934854492825529
In [64]:
# Example of text query 
# text = 'two giraffes standing near trees'

# YOUR QUERY-1
text1 = 'a brown bear handing out of a car with sharp teeth'

# DO NOT CHANGE BELOW CODE
search_image(text1)
None
Distance = 6.274925555692796
None
Distance = 6.3484260062137245
None
Distance = 6.609440878412493
None
Distance = 6.682241121687379
None
Distance = 6.69092619696954
None
Distance = 6.790634324315862
None
Distance = 6.822132412554352
None
Distance = 6.869062669588359
None
Distance = 6.885290361350142
None
Distance = 6.932044591534094
In [37]:
# YOUR QUERY-1
text1 = 'a man flying through the air while riding a snowboard'

# DO NOT CHANGE BELOW CODE
search_image(text1)
None
Distance = 4.484657506545081
None
Distance = 4.520529809905118
None
Distance = 4.618544200323592
None
Distance = 4.628606212770155
None
Distance = 4.654731375039252
None
Distance = 4.674640519861745
None
Distance = 4.6796051651298605
None
Distance = 4.696691984620176
None
Distance = 4.6987521096743645
None
Distance = 4.729200106708689
In [37]:
# YOUR QUERY-2
text2 = 'bicycle'

# DO NOT CHANGE BELOW CODE
search_image(text2)
None
Distance = 8.026987228463918
None
Distance = 8.254930465023522
None
Distance = 8.380506180688496
None
Distance = 8.406467618844795
None
Distance = 8.5651540756869
None
Distance = 8.594451988553846
None
Distance = 8.647437520110218
None
Distance = 8.66930480664991
None
Distance = 8.677923206899823
None
Distance = 8.696367600448479
In [30]:
# YOUR QUERY-2
text2 = 'a man with a tie on and a head band in celebration of a holiday'

# DO NOT CHANGE BELOW CODE
search_image(text2)
None
Distance = 6.607065780819543
None
Distance = 6.662265938232747
None
Distance = 6.72597808719141
None
Distance = 6.8365060326994
None
Distance = 6.862282791558259
None
Distance = 6.995459166571046
None
Distance = 7.14315199287061
None
Distance = 7.152672663421265
None
Distance = 7.228435455708606
None
Distance = 7.28081484914387
In [86]:
# YOUR QUERY-2
text2 = 'small bathroom'

# DO NOT CHANGE BELOW CODE
search_image(text2)
None
Distance = 3.9039987924138777
None
Distance = 3.9884832700930826
None
Distance = 4.0229551168801665
None
Distance = 4.123830354926401
None
Distance = 4.173737736228077
None
Distance = 4.2180315571773965
None
Distance = 4.256208417661942
None
Distance = 4.256826393296377
None
Distance = 4.265413562802005
None
Distance = 4.268016093848858
In [66]:
# YOUR QUERY-2
text2 = 'a man holding an american flag riding down the street on a horse'

# DO NOT CHANGE BELOW CODE
search_image(text2)
None
Distance = 9.703907954288969
None
Distance = 9.836883003520521
None
Distance = 10.351453896151888
None
Distance = 10.363825503307366
None
Distance = 10.753649936912742
None
Distance = 10.755981397697242
None
Distance = 10.838579054964965
None
Distance = 10.893018842456353
None
Distance = 10.896056356681429
None
Distance = 10.935609296401935
In [90]:
# YOUR QUERY-2
text2 = 'sandwich'

# DO NOT CHANGE BELOW CODE
search_image(text2)
None
Distance = 3.8900310515008556
None
Distance = 3.9494303356475755
None
Distance = 3.978968795501242
None
Distance = 4.0516679653475896
None
Distance = 4.090944482683862
None
Distance = 4.128056028457592
None
Distance = 4.129634290762021
None
Distance = 4.144801058917581
None
Distance = 4.145188166391407
None
Distance = 4.186552667590504
In [67]:
# YOUR QUERY-2
text2 = 'a woman in a kitchen holding a carton'

# DO NOT CHANGE BELOW CODE
search_image(text2)
None
Distance = 8.092475378013917
None
Distance = 8.097496875467893
None
Distance = 8.16556991462669
None
Distance = 8.169497130601657
None
Distance = 8.278403824915193
None
Distance = 8.287378540552977
None
Distance = 8.30749583472757
None
Distance = 8.341517039032366
None
Distance = 8.408200628379456
None
Distance = 8.422861397383754

Briefly discuss the result. Why or how it works, and why do you think it does not work at some point.

Answer:

The results for the image retrieval are relatively good. The images resulting from the queries often closely related to the query. For example when asking for giraffes most of the images contain a giraffe eventhough sometimes few zebras and horses were retrieved. When detecting unique object such as differentiating between animals, our model could retrieve images more accurately. This is because the feature are easily distinguishable (shape of legs, body, color, etc).

But when identifying feature in similar object (for example gender identification in person) our model still cannot determine correctly most of the time. For example when query 'a woman in a kitchen holding a carton' was asked, there are man picture retrieved. Perhaps identifying a gender is not as easy as identifying a giraffe from horses where the size and the color are clearly different, while for a person it is not clearly distinguishable. Another example is bathroom often retrieved together with kitchen since they usually have dominant white background and some shelves.

This performances can be improved by adding more iterations to the model such that the loss in training is as small as possible. When we try using 40 iterations model, our model is returning train images when given 'bicycle' query which is very different in terms of shape and size. But when we use 60 iteration models, it is returning bicycle and motorcycle images which are closely related. We believe, by adding few hundreds more iterations, better retrieval model can be achieved